skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Zhang, Yifan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as low as 1.58~bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected long latency of the prefill stage. We present TeLLMe, the first table-lookup-based ternary LLM accelerator for low-power edge FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. TeLLMe incorporates several novel techniques, including (1) a table-lookup-based ternary matrix multiplication (TLMM) engine utilizing grouped activations and online precomputation for low resource utilization and high throughput; (2) a fine-grained analytic URAM-based weight buffer management scheme for efficient loading and compute engine access; (3) a streaming dataflow architecture that fuses floating-point element-wise operations with linear computations to hide latency; (4) a reversed-reordered prefill stage attention with fused attention operations for high memory efficiency; and (5) a resource-efficient specialized decoding stage attention. Under a 5~W power budget, TeLLMe delivers up to 25~tokens/s decoding throughput and 0.45--0.96~s time-to-first-token (TTFT) for 64--128 token prompts, marking a significant energy-efficiency advancement in LLM inference on edge FPGAs. 
    more » « less
    Free, publicly-accessible full text available October 21, 2026
  2. Transformer-based models have demonstrated superior performance in various fields, including natural language processing and computer vision. However, their enormous model size and high demands in computation, memory, and communication limit their deployment to edge platforms for local, secure inference. Binary transformers offer a compact, low-complexity solution for edge deployment with reduced bandwidth needs and acceptable accuracy. However, existing binary transformers perform inefficiently on current hardware due to the lack of binary specific optimizations. To address this, we introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing. COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods. With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs, delivering a 311× energy efficiency improvement over GPUs and a 3.5× throughput improvement over the state-of-the-art binary accelerator, with only negligible inference accuracy degradation. 
    more » « less
    Free, publicly-accessible full text available October 27, 2026
  3. Free, publicly-accessible full text available October 1, 2026
  4. Path planning is a critical task for autonomous driving, aiming to generate smooth, collision-free, and feasible paths based on input perception and localization information. The planning task is both highly time-sensitive and computationally intensive, posing significant challenges to resource-constrained autonomous driving hardware. In this article, we propose an end-to-end framework for accelerating path planning on FPGA platforms. This framework focuses on accelerating quadratic programming (QP) solving, which is the core of optimization-based path planning and has the most computationally-intensive workloads. Our method leverages a hardware-friendly alternating direction method of multipliers (ADMM) to solve QP problems while employing a highly parallelizable preconditioned conjugate gradient (PCG) method for solving the associated linear systems. We analyze the sparse patterns of matrix operations in QP and design customized storage schemes along with efficient sparse matrix multiplication and sparse matrix-vector multiplication units. Our customized design significantly reduces resource consumption for data storage and computation while dramatically speeding up matrix operations. Additionally, we propose a multi-level dataflow optimization strategy. Within individual operators, we achieve acceleration through parallelization and pipelining. For different operators in an algorithm, we analyze inter-operator data dependencies to enable fine-grained pipelining. At the system level, we map different steps of the planning process to the CPU and FPGA and pipeline these steps to enhance end-to-end throughput. We implement and validate our design on the AMD ZCU102 platform. Our implementation achieves state-of-the-art performance in both latency and energy efficiency compared with existing works, including an average 1.48× speedup over the best FPGA-based design, a 2.89× speedup compared with the state-of-the-art QP solver on an Intel i7-11800H CPU, a 5.62× speedup over an ARM Cortex-A57 embedded CPU, and a 1.56× speedup over state-of-the-art GPU-based work. Furthermore, our design delivers a 2.05× improvement in throughput compared with the state-of-the-art FPGA-based design. 
    more » « less
    Free, publicly-accessible full text available September 30, 2026
  5. Free, publicly-accessible full text available June 1, 2026
  6. Attracting students to computing is crucial for advancing the development of new skills and fostering positive attitudes toward the field, especially among females and minoritized populations. One promising approach involves integrating computing with artistic activities, such as music. This study examines how learner’s prior experiences influence their participation in a virtual summer camp on coding with music. The study also examines how participation in the camp influences participants' attitudes about computing, with an eye toward gender differences. Data were collected through participant surveys (N=73) and focus groups (N=48). Findings suggest that parents’ and guardians' involvement is crucial for participation and integrating coding with artistic work holds promise for attracting students to the field. Findings can inform possible paths to engaging students in computing. 
    more » « less
    Free, publicly-accessible full text available April 30, 2026
  7. Abstract Previous research has demonstrated significant inter-individual variability in the recruitment of the fast-explicit and slow-implicit processes during motor adaptation. In addition, we previously identified qualitative individual differences in adaptation linked to the formation and updating of new memory processes. Here, we investigated quantitative and qualitative differences in visuomotor adaptation with a design incorporating repeated learning and forgetting blocks, allowing for precise estimation of individual learning and forgetting rates in fast-slow adaptation models. Participants engaged in a two-day online visuomotor adaptation task. They first adapted to a 30-degree perturbation to eight targets in three blocks separated by short blocks of no feedback trials. Approximately 24 hours later, they performed a no-feedback retention block and a relearning block. We clustered the participants into strong and weak learners based on adaptation levels at the end of day one and fitted a fast-slow system to the adaptation data. Strong learners exhibited a strong negative correlation between the estimated slow and fast processes, which predicted 24-hour retention and savings, respectively, supporting the engagement of a fast-slow system. The pronounced individual differences in the recruitment of the two processes were attributed to wide ranges of estimated learning rates. Conversely, weak learners exhibited a positive correlation between the two estimated processes, as well as retention but no savings, supporting the engagement of a single slow system. Finally, both during baseline and adaptation, reaction times were shorter for weak learners. Our findings thus revealed two distinct ways to learn in visuomotor adaptation and highlight the necessity of considering both quantitative and qualitative individual differences in studies of motor learning. 
    more » « less
  8. Preparing long-range entangled states poses significant challenges for near-term quantum devices. It is known that measurement and feedback (MF) can aid this task by allowing the preparation of certain paradigmatic long-range entangled states with only constant circuit depth. Here, we systematically explore the structure of states that can be prepared using constant-depth local circuits and a single MF round. Using the framework of tensor networks, the preparability under MF translates to tensor symmetries. We detail the structure of matrix-product states (MPSs) and projected entangled-pair states (PEPSs) that can be prepared using MF, revealing the coexistence of Clifford-like properties and magic. In one dimension, we show that states with Abelian-symmetry-protected topological order are a restricted class of MF-preparable states. In two dimensions, we parametrize a subset of states with Abelian topological order that are MF preparable. Finally, we discuss the analogous implementation of operators via MF, providing a structural theorem that connects to the well-known Clifford teleportation. Published by the American Physical Society2024 
    more » « less
  9. The demand for clean energy production and storage has increased interest in molten salt technologies, including Molten Salt Reactors (MSR). Understanding of how molten salts properties change with respect to temperature and structure is vital to establishing efficient, cost effective MSR systems. Research into these materials however has been limited due to the difficulty in accurately measuring properties of these reactive materials at elevated temperatures and controlled environment in a time efficient way. Much research has turned to molecular dynamic (MD) modeling to alleviate these issues. This research presents a custom fabricated falling ball viscometer system for measuring molten salt viscosity quickly. A model for correlating velocity to viscosity for Re < 300 was also developed for use with this system. The viscometer is demonstrated on eutectic FLiNaK and NaF-ZrF4 (53–47 mol%) up to 150 K above the respective melting points. The results are compared to MD simulations to verify their effectiveness for predicting viscosity and previously reported measurements. 
    more » « less